Kaggle Competition


Project 1: Bag of Words Meets Bags of Popcorn



Dataset visualization and pre-processing


Import packages


In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import roc_auc_score,roc_curve
from sklearn.decomposition import TruncatedSVD 
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import MultinomialNB
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')


c:\users\workshop\appdata\local\programs\python\python36-32\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\workshop\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\workshop\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\workshop\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Out[1]:
True

Import dataset


In [2]:
train_data=pd.read_csv("../../../data-project1/labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)
train_data.head()


Out[2]:
id sentiment review
0 "5814_8" 1 "With all this stuff going down at the moment ...
1 "2381_9" 1 "\"The Classic War of the Worlds\" by Timothy ...
2 "7759_3" 0 "The film starts with a manager (Nicholas Bell...
3 "3630_4" 0 "It must be assumed that those who praised thi...
4 "9495_8" 1 "Superbly trashy and wondrously unpretentious ...

In [3]:
train_data.tail()


Out[3]:
id sentiment review
24995 "3453_3" 0 "It seems like more consideration has gone int...
24996 "5064_1" 0 "I don't believe they made this film. Complete...
24997 "10905_3" 0 "Guy is a loser. Can't get girls, needs to bui...
24998 "10194_3" 0 "This 30 minute documentary Buñuel made in the...
24999 "8478_8" 1 "I saw this movie as a child and it broke my h...

Notice that 'sentiment' is binary


In [4]:
train_data.dtypes


Out[4]:
id           object
sentiment     int64
review       object
dtype: object

Type 'object' is a string for pandas. We shall later convert to number representation,maybe using typical bag-of-words or word2vec

Starting getting basic information of data:


In [5]:
train_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 3 columns):
id           25000 non-null object
sentiment    25000 non-null int64
review       25000 non-null object
dtypes: int64(1), object(2)
memory usage: 390.7+ KB

Now that we already have general idea of Data Set. We next clean, transform data to create useful features for machine learning


First Attempt Summary


  • Feature 'review'
    • Processing raw text
    • Transforming feature 'review': bag-of-words model
    • Extending bag-of-words with TF-IDF weights
    • Dimensionality reduction
  • Training Naive Bayes
  • Predicting with Naive Bayes
  • Preparing for kaggle submission
  • Performance Evaluation
    • Splitting train data set
    • Evaluating performance using splitted data set
    • Plotting ROC curve
  • Hyperparameters
  • Other improvements

Feature 'review'

Processing raw text

We will start wrting function for analyzing and cleaning the deature 'review', using first review as a point of illustration


In [6]:
train_data.review[0]


Out[6]:
'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci\'s character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ\'s music.<br /><br />Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.<br /><br />Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ\'s bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i\'ve gave this subject....hmmm well i don\'t know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter."'

Before we can transform text into number representation, we need to process raw text. Let's first remove HTML and puctuation


In [7]:
soup=BeautifulSoup(train_data.review[0]).get_text()
letters_only = re.sub("[^a-zA-Z]"," ",soup )
letters_only


c:\users\workshop\appdata\local\programs\python\python36-32\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html5lib"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 193 of the file c:\users\workshop\appdata\local\programs\python\python36-32\lib\runpy.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "html5lib")

  markup_type=markup_type))
Out[7]:
' With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him The actual feature film bit when it finally starts is only on for    minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord  Why he wants MJ dead so bad is beyond me  Because MJ overheard his plans  Nah  Joe Pesci s character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno  maybe he just hates MJ s music Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence  Also  the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene Bottom line  this movie is for people who like MJ on one level or another  which i think is most people   If not  then stay away  It does try and give off a wholesome message and ironically MJ s bestest buddy in this movie is a girl  Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty  Well  with all the attention i ve gave this subject    hmmm well i don t know because people can be different behind closed doors  i know this for a fact  He is either an extremely nice but stupid guy or one of the most sickest liars  I hope he is not the latter  '

Now we start stemming and lemmatizing the text, but it is generally better to first create the pos tagger as we only want to lemmatize verb and noum


In [8]:
tokens=nltk.word_tokenize(letters_only.lower())
tagged_words=nltk.pos_tag(tokens)
tagged_words[0:5]


Out[8]:
[('with', 'IN'),
 ('all', 'PDT'),
 ('this', 'DT'),
 ('stuff', 'NN'),
 ('going', 'VBG')]

Stemming the text: There are genrally 2 stemmers available in nltk, porter and lancaster


In [9]:
porter=nltk.PorterStemmer()
def lemmatize_with_potter(token,tag):
    if tag[0].lower in ['v','n']:
        return  porter.stem(token)
    return token
stemmed_text_with_potter=[lemmatize_with_potter(token,tag) for token,tag in tagged_words]

lancaster=nltk.LancasterStemmer()
def lemmatize_with_lancaster(token,tag):
    if tag[0].lower in ['v','n']:
        return  lancaster.stem(token)
    return token
stemmed_text_with_lancaster=[lemmatize_with_lancaster(token,tag) for token,tag in tagged_words]

In [10]:
stemmed_text_with_potter[0:10]


Out[10]:
['with',
 'all',
 'this',
 'stuff',
 'going',
 'down',
 'at',
 'the',
 'moment',
 'with']

In [11]:
stemmed_text_with_lancaster[0:10]


Out[11]:
['with',
 'all',
 'this',
 'stuff',
 'going',
 'down',
 'at',
 'the',
 'moment',
 'with']

Observing that the word 'going' has been stemmed with porter but not with lancaster, I'll choose porter for this task.

let's lemmatizing


In [12]:
tagged_words_after_stem=nltk.pos_tag(stemmed_text_with_potter)
wnl = nltk.WordNetLemmatizer()
def lemmatize_with_WordNet(token,tag):
    if tag[0].lower in ['v','n']:
        return wnl.lemmatize(token)
    return token
stemmed_and_lemmatized_text=[lemmatize_with_WordNet(token,tag) for token,tag in tagged_words_after_stem]
stemmed_and_lemmatized_text[0:10]


Out[12]:
['with',
 'all',
 'this',
 'stuff',
 'going',
 'down',
 'at',
 'the',
 'moment',
 'with']

text cleanning summary


In [13]:
porter=nltk.PorterStemmer()
wnl = nltk.WordNetLemmatizer()

def stemmatize_with_potter(token,tag):
    if tag[0].lower() in ['v','n']:
        return  porter.stem(token)
    return token


def lemmatize_with_WordNet(token,tag):
    if tag[0].lower() in ['v','n']:
        return wnl.lemmatize(token)
    return token

def corpus_preprocessing(corpus):
    preprocessed_corpus = []
    for sentence in corpus:	
        #remove HTML and puctuation
        soup=BeautifulSoup(sentence).get_text()
        letters_only = re.sub("[^a-zA-Z]"," ",soup )

        #Stemming
        tokens=nltk.word_tokenize(letters_only.lower())
        tagged_words=nltk.pos_tag(tokens)
        stemmed_text_with_potter=[stemmatize_with_potter(token,tag) for token,tag in tagged_words]

        #lemmatization
        tagged_words_after_stem=nltk.pos_tag(stemmed_text_with_potter)
        stemmed_and_lemmatized_text=[lemmatize_with_WordNet(token,tag) for token,tag in tagged_words_after_stem]
        
        #join all the tokens
        clean_review=" ".join(w for w in  stemmed_and_lemmatized_text)
        preprocessed_corpus.append(clean_review)

    return preprocessed_corpus
Transforming feature 'review': bag-of-words model

Let's transform feature 'review' into numerical representation to feed into machine learning. The common representation of text is the bag-of-words model

in Sklearn, we can use class CountVectorize to transform the data. We shall also use stop-words to reduce the dimension of feature space. Let's now first 5 data from train Dataset to be test_corpus


In [14]:
vectorizer=CountVectorizer(stop_words='english')
test_corpus=train_data.review[0:5]
test_corpus= corpus_preprocessing(test_corpus)
test_corpus=vectorizer.fit_transform(test_corpus)
print(test_corpus.todense())


c:\users\workshop\appdata\local\programs\python\python36-32\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html5lib"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 193 of the file c:\users\workshop\appdata\local\programs\python\python36-32\lib\runpy.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "html5lib")

  markup_type=markup_type))
[[0 0 1 ... 0 0 0]
 [0 0 0 ... 1 0 0]
 [0 1 0 ... 0 1 1]
 [0 1 0 ... 0 0 0]
 [1 1 1 ... 1 0 0]]
Extending bag-of-words with TF-IDF weights

We could extend the bag-of-words representation with tf-idf to reflect how important a word to a document in a corpus

tdf-idf can be applied with class TfidfVectorizer in sklearn


In [16]:
vectorizer= TfidfVectorizer(stop_words='english')
test_corpus=train_data.review[0:5]
test_corpus= corpus_preprocessing(test_corpus)
test_corpus=vectorizer.fit_transform(test_corpus)
print (test_corpus.todense())


c:\users\workshop\appdata\local\programs\python\python36-32\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html5lib"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 193 of the file c:\users\workshop\appdata\local\programs\python\python36-32\lib\runpy.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "html5lib")

  markup_type=markup_type))
[[0.         0.         0.0416204  ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.08111972 0.         0.        ]
 [0.         0.0393955  0.         ... 0.         0.05882456 0.05882456]
 [0.         0.0452348  0.         ... 0.         0.         0.        ]
 [0.06650886 0.04454176 0.05365893 ... 0.05365893 0.         0.        ]]
Dimensionality reduction

Using stop_words was one technique to reduce dimensionality. We can further reduce the dimensinality by using latent sematic analysis

In sklearn, we can apply class TruncatedSVD into tf-idf matrix


In [17]:
tsvd=TruncatedSVD(100)
tsvd.fit(test_corpus)
test_corpus=tsvd.transform(test_corpus)
test_corpus


Out[17]:
array([[ 0.59640897,  0.14453626, -0.02568868, -0.1569847 ,  0.77337022],
       [ 0.42191081,  0.81260372,  0.05178786,  0.0606323 , -0.39409161],
       [ 0.45916278, -0.29326902, -0.25540027,  0.7938603 , -0.08785976],
       [ 0.47489155, -0.3237962 , -0.49040274, -0.5458162 , -0.36224836],
       [ 0.42144351, -0.33366922,  0.81536317, -0.08841504, -0.19599974]])

Training Naive Bayes

Sklearn provides several kinds of Naives classifiers: GaussianNB, MultinomialNB and BernoulliNB. We will choose MultinomialNB for this task


In [18]:
model=MultinomialNB()
Fitting the training data

In [19]:
#features from train set
train_features=train_data.review

#pro-processing train features
train_features=corpus_preprocessing(train_features)
vectorizer= TfidfVectorizer(stop_words='english')
train_features=vectorizer.fit_transform(train_features)
tsvd=TruncatedSVD(100)
tsvd.fit(train_features)
train_features=tsvd.transform(train_features)

#target from train set 
train_target=train_data.sentiment

#fitting the model
model.fit(train_features,train_target)


c:\users\workshop\appdata\local\programs\python\python36-32\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html5lib"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 193 of the file c:\users\workshop\appdata\local\programs\python\python36-32\lib\runpy.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "html5lib")

  markup_type=markup_type))
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-19-5a61e2137d6c> in <module>()
      3 
      4 #pro-processing train features
----> 5 train_features=corpus_preprocessing(train_features)
      6 vectorizer= TfidfVectorizer(stop_words='english')
      7 train_features=vectorizer.fit_transform(train_features)

<ipython-input-13-6ee6a6dbb5cf> in corpus_preprocessing(corpus)
     26 
     27         #lemmatization
---> 28         tagged_words_after_stem=nltk.pos_tag(stemmed_text_with_potter)
     29         stemmed_and_lemmatized_text=[lemmatize_with_WordNet(token,tag) for token,tag in tagged_words_after_stem]
     30 

c:\users\workshop\appdata\local\programs\python\python36-32\lib\site-packages\nltk\tag\__init__.py in pos_tag(tokens, tagset, lang)
    131     :rtype: list(tuple(str, str))
    132     """
--> 133     tagger = _get_tagger(lang)
    134     return _pos_tag(tokens, tagset, tagger)
    135 

c:\users\workshop\appdata\local\programs\python\python36-32\lib\site-packages\nltk\tag\__init__.py in _get_tagger(lang)
     95         tagger.load(ap_russian_model_loc)
     96     else:
---> 97         tagger = PerceptronTagger()
     98     return tagger
     99 

c:\users\workshop\appdata\local\programs\python\python36-32\lib\site-packages\nltk\tag\perceptron.py in __init__(self, load)
    138         self.classes = set()
    139         if load:
--> 140             AP_MODEL_LOC = 'file:'+str(find('taggers/averaged_perceptron_tagger/'+PICKLE))
    141             self.load(AP_MODEL_LOC)
    142 

c:\users\workshop\appdata\local\programs\python\python36-32\lib\site-packages\nltk\data.py in find(resource_name, paths)
    625 
    626         # Is the path item a directory or is resource_name an absolute path?
--> 627         elif not path_ or os.path.isdir(path_):
    628             if zipfile is None:
    629                 p = os.path.join(path_, url2pathname(resource_name))

KeyboardInterrupt: 

Predicting with Naive Bayes


In [ ]:
#reading test data
test_data=train_data=pd.read_csv("../../../data-project1/testData.tsv", header=0,delimiter="\t", quoting=3)

#features from test data
test_features=test_data.review

#pre-processing test features
test_features=corpus_preprocessing(test_features)
test_features=vectorizer.transform(test_features)
test_features=tsvd.transform(test_features)

#predicting the sentiment for test set
prediction=model.predict(test_features)

Preparing for kaggle submission


In [ ]:
#writing out submission file
pd.DataFrame( data={"id":test_data["id"], "sentiment":prediction} ).to_csv("../../../data-project1/first_attempt.csv", index=False, quoting=3 )

Performance Evaluation

A variety of metrics exist to evaluate the performance for binary classifiers, i.e accuracy, precision, recall, F1 measure, ROC AUC score. We shall use ROC AUC score for this task as specified by competition site.

Splitting train data set

We first splitting the train data set for cross validation, let's choose 80% for split_train set and 20% for split test_set


In [ ]:
# Split 80-20 train vs test data
split_train_features, split_test_features, split_train_target, split_test_target = train_test_split(train_features, 
                                                                                                   train_target, 
                                                                                                   test_size = 0.20, 
                                                                                                   random_state = 0)
Evaluating model using splitted data set

ROC curve illustrates the classifier's performance for all values of the discrimination threshold.


In [ ]:
#pre-processing split train 
vectorizer= TfidfVectorizer(stop_words='english')
split_train_features = corpus_preprocessing(split_train_features)
split_train_features = vectorizer.fit_transform(split_train_features)
tsvd=TruncatedSVD(100)
tsvd.fit(split_train_features)
split_train_features = tsvd.transform(split_train_features)

#pre-processing split test features
split_test_features = corpus_preprocessing(split_test_features)
split_test_features = vectorizer.transform(split_test_features)
split_test_features = tsvd.transform(split_test_features)

#fit and predict using split data
model = MultinomialNB()
model.fit(split_train_features,split_train_target)
split_prediction = model.predict(split_test_features)
score=roc_auc_score(split_test_target, split_prediction)
print (score(split_test_target, split_prediction))
Plotting ROC curve

ROC curves plot the classifier's recall against its fall-out.


In [ ]:
false_positive_rates ,recall,thresholds=roc_curve(split_test_target,split_prediction)
plt.title('Receiver Operating Charisteristic')
plt.plot(false_positive_rates,recall,'r', label='AUC = %0.2f' %score)
plt.legend(loc = 'lower right')
plt.ylable('Recall')
plt.xlable('False positive rate')
plt.show()

The source code of the first attempt can be found here and evaluation script here

Hyperparameters

Class MultinomialNb has a parameter value alpha (default=1.0) We could try to run on another value of alpha to see how the score would change.


In [ ]:
model=MultinomialNB(alpha=0.1)
model.fit(split_train_features,split_train_target)
split_prediction=model.predict(split_test_features)
score=roc_auc_score(split_test_target, split_predict)
print (score(split_test_target, split_predict))

Let's try to generate score over a range of alpha


In [ ]:
alphas=np.logspace(-5,0,6)
print alphas

In [ ]:
def evaluate_alpha(train_features,train_target,test_features,test_target,model,parameter_value, parameter_name):
    scores=[]
    for test_alpha in params:
        model. set_params(**{parameter_name:test_alpha})
        model.fit(train_features,train_target)
        prediction=model.predict(test_features)
        score=roc_auc_score(test_target, prediction)
        scores.append((test_alpha,score))

model=MultinomialNB()
alpha_score=evaluate_alpha(split_train_features,split_train_target,split_test_features,split_test_target,model,alphas,'alpha')

Other improvements

  • For numerical representation of text, hashing trick is worth attempting due to its memory advantage. Class HashingVectorizer in sklearn provide this trick
  • We could use cross_validation technique for hyperparameters rather than just relying on 1 set of splitted data. Class KFold from sklearn could be useful
  • The contest has also provided us with unlabled Data set, we could make meaninful representation of this data set with wWord2Vec model (second attempt)